[Objective] This paper discusses the classical entity resolution methods and logical thinking in entity resolution theory. [Coverage] Google Scholar and CNKI were respectively used to search literatures with the keywords “Entity Resolution”, “Collective Analysis”, “Crowdsourced”, “Active Learning”, “Privacy-Preserving” and “Entity Resolution” in Chinese. I then obtained a total of 86 representative literatures in conjunction with topic screening, intensive reading and retrospective method. [Methods] For each entity resolution method, the paper first summarizes and analyzes the basic idea of the method, and presents the resolution process through illustration, and then focuses on analyzing the key strategies, algorithms or techniques adopted by the existing research in the process of implementation of the method. [Results] Entity resolution is the basic operation of data quality management, and the key step to find the value of data. [Limitations] There is no in-depth analysis of the evaluation indicators and application of each entity resolution method. [Conclusions] Although existing entity resolution methods can meet the requirements of most applications to some extent, they still face challenges in data heterogeneity, privacy protection and distributed environment in the big data environment.
(Meng Xiaofeng, Du Zhijuan.Research on the Big Data Fusion: Issues and Challenges[J]. Journal of Computer Research and Development, 2016, 53(2): 231-246.)
(Li Jianzhong, Wang Hongzhi, Gao Hong.State-of-the-Art of Research on Big Data Usability[J]. Journal of Software, 2016, 27(7): 1605-1625.)
[3]
Dong X L, Srivastava D.Big Data Integration[C]// Proceedings of the 29th International Conference on Data Engineering. 2013: 1245-1248.
[4]
Getoor L, Machanavajjhala A.Entity Resolution: Theory, Practice & Open Challenges[J]. Proceedings of the VLDB Endowment, 2012, 5(12): 2018-2019.
[5]
Dong X L, Rekatsinas T.Data Integration and Machine Learning: A Natural Synergy[C]// Proceedings of the 2018 International Conference on Management of Data. 2018: 1645-1650.
[6]
Newcombe H B, Kennedy J M, Axford S J, et al.Automatic Linkage of Vital Records[J]. Science, 1959, 130(3381): 954-959.
[7]
Talburt J R.Entity Resolution and Information Quality[M]. Elsevier, 2011.
[8]
Fellegi I P, Sunter A B.A Theory for Record Linkage[J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
[9]
Cohen W W, Ravikumar P, Fienberg S E.A Comparison of String Metrics for Matching Names and Records[A]// KDD Workshop on Data Cleaning & Object Consolidation[M]. 2003, 4(2): 73-78.
[10]
Naumann F, Herschel M.An Introduction to Duplicate Detection[J]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.
[11]
Doan A, Halevy A, Ives Z.Principles of Data Integration[M]. Elsevier, 2012.
[12]
Levenshtein V I.Binary Codes Capable of Correcting Deletions, Insertions, and Reversals[J]. Doklady Akademii Nauk SSSR, 1965, 163(4): 845-848.
[13]
Jaro M A.Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida[J]. Journal of the American Statistical Association, 1989, 84(406): 414-420.
[14]
Winkler W E, Thibaudeau Y.An Application of the Felligi Sunter Model of Record Linkage to the 1990 US Decennial Census[R]. Washington: US Bureau of the Census, 1991.
[15]
On B W, Koudas N, Lee D, et al.Group Linkage[C]// Proceedings of the 23rd International Conference on Data Engineering. 2007: 496-505.
[16]
Bilenko M, Mooney R J, Cohen W W, et al.Adaptive Name Matching in Information Integration[J]. IEEE Intelligent Systems, 2003, 18(5): 16-23.
[17]
Cohen W W, Ravikumar P, Fienberg S E.A Comparison of String Distance Metrics for Name-Matching Tasks[C]// Proceedings of the 2003 International Joint Conference on Artificial Intelligence. 2003: 73-78.
[18]
Monge A E, Elkan C.The Field Matching Problem: Algorithms and Applications[C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996: 267-270.
[19]
Whang S E, Garcia-Molina H.Developments in Generic Entity Resolution[A]// Bulletin of the Technology Committee on Data Engineering[M]. IEEE Computer Society, 2011.
[20]
Benjelloun O, Garcia-Molina H, Kawai H, et al.Generic Entity Resolution in the Serf Project[R]. Stanford InfoLab, 2006.
[21]
Benjelloun O, Garcia-Molina H, Menestrina D, et al.Swoosh: A Generic Approach to Entity Resolution[J]. The VLDB Journal, 2009, 18(1): 255-276.
[22]
Kawai H, Garcia-Molina H, Benjelloun O, et al.P-Swoosh: Parallel Algorithm for Generic Entity Resolution[R]. 2006.
[23]
Kim H S, Lee D.Parallel Linkage[C]// Proceedings of the 16th ACM Conference on Information and Knowledge Management. 2007: 283-292.
[24]
Kawai H, Garcia-Molina H, Benjelloun O, et al.Bufoosh: Buffering Algorithms for Generic Entity Resolution[R]. 2006.
[25]
Benjelloun O, Garcia-Molina H, Gong H, et al.D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution[C]// Proceedings of the 27th International Conference on Distributed Computing Systems. IEEE, 2007.
[26]
Bhattacharya I, Getoor L.Collective Entity Resolution in Relational Data[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 5-18.
[27]
Malin B, Airoldi E, Carley K M.A Network Analysis Model for Disambiguation of Names in Lists[J]. Computational & Mathematical Organization Theory, 2005, 11(2): 119-139.
(Gao Xuedong, Huang Yue.Co-Clustering Algorithm for Collective Entity Resolution of Multi-Typed Objects[J]. Systems Engineering - Theory & Practice, 2015, 35(4): 997-1004.)
[29]
Dong X, Halevy A, Madhavan J.Reference Reconciliation in Complex Information Spaces[C]// Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. 2005: 85-96.
(Sun Chenchen, Shen Derong, Kou Yue, et al.A Related Data Oriented Joint Entity Resolution Approach[J]. Chinese Journal of Computers, 2015, 38(9): 1739-1754.)
[31]
Kalashnikov D V.Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph[J]. ACM Transactions on Database Systems, 2006, 31(2): 716-767.
(Feng Jianhong, Li Guoliang, Feng Jianhua.A Survey on Crowdsourcing[J]. Chinese Journal of Computers, 2015, 38(9): 1713-1726.)
[33]
Lee J, Cho H, Park J W, et al.Hybrid Entity Clustering Using Crowds and Data[J]. The VLDB Journal, 2013, 22(5): 711-726.
[34]
Wang J, Kraska T, Franklin M J, et al.CrowdER: Crowdsourcing Entity Resolution[J]. Proceedings of the VLDB Endowment, 2012, 5(11): 1483-1494.
[35]
Wang S, Xiao X, Lee C H.Crowd-Based Deduplication: An Adaptive Approach[C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 2015: 1263-1277.
[36]
Demartini G, Difallah D E, Cudré-Mauroux P, et al.ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking[C]// Proceedings of the 21st International Conference on World Wide Web. 2012: 469-478.
[37]
Liu X, Lu M, Ooi B C, et al.CDAS: A Crowdsourcing Data Analytics System[J]. Proceedings of the VLDB Endowment, 2012, 5(10): 1040-1051.
[38]
Settles B.Active Learning Literature Survey[R]. University of Wisconsinmadison, 2009.
(Yang Wenzhu, Tian Xiaoxiao, Wang Sile, et al.Recent Advances in Active Learning Algorithms[J]. Journal of Hebei University: Natural Science Edition, 2017, 37(2): 216-224.)
[40]
Sarawagi S, Bhamidipaty A.Interactive Deduplication Using Active Learning[C]// Proceeding of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002: 269-278.
[41]
Tejada S, Knoblock C A, Minton S.Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification[C]// Proceeding of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002: 350-359.
[42]
Bellare K, Iyengar S, Parameswaran A G, et al.Active Sampling for Entity Matching[C]// Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012: 1131-1139.
[43]
Arasu A, Kaushik R.On Active Learning of Record Matching Packages[C]// Proceeding of the 10th ACM SIGMOD International Conference on Management of Data. 2010: 783-794.
[44]
Qian K, Popa L, Sen P.Active Learning for Large-Scale Entity Resolution[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1379-1388.
[45]
Fisher J, Christen P, Wang Q.Active Learning Based Entity Resolution Using Markov Logic[C]// Proceedings of the 2016 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2016: 338-349.
[46]
Fu Y, Zhu X, Li B.A Survey on Instance Selection for Active Learning[J]. Knowledge and Information Systems, 2013, 35(2): 249-283.
[47]
Ramadan B.Indexing Techniques for Real-Time Entity Resolution[D]. Canberra: The Australian National University, 2016.
[48]
Ramadan B, Christen P, Liang H, et al. Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution[J]. Journal of Data and Information Quality (JDIQ), 2015, 6(4): Article No.15.
[49]
Bayardo R J, Ma Y, Srikant R.Scaling up All Pairs Similarity Search[C]// Proceedings of the 16th International Conference on World Wide Web. 2007: 131-140.
[50]
Broder A Z, Carmel D, Herscovici M, et al.Efficient Query Evaluation Using a Two-Level Retrieval Process[C]// Proceedings of the 12th International Conference on Information and Knowledge Management. 2003: 426-434.
[51]
Zobel J, Moffat A.Inverted Files for Text Search Engines[J]. ACM Computing Surveys(CSUR), 2006, 38(2): 6.
[52]
Christen P.A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537-1555.
[53]
Christen P, Gayler R, Hawking D.Similarity-Aware Indexing for Real-Time Entity Resolution[C]// Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009: 1565-1568.
[54]
Ramadan B, Christen P, Liang H, et al.Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution[C]// Proceedings of the 2013 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2013: 47-58.
[55]
Ramadan B, Christen P.Unsupervised Blocking Key Selection for Real-Time Entity Resolution[C]// Proceedings of the 2015 Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2015: 574-585.
[56]
Papadakis G, Koutrika G, Palpanas T, et al.Meta-Blocking: Taking Entity Resolutionto the Next Level[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8): 1946-1960.
[57]
Efthymiou V, Papadakis G, Papastefanatos G, et al.Parallel Meta-Blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data[C]// Proceedings of the 2015 IEEE International Conference on Big Data. 2015: 411-420.
[58]
Dragut E C, Ouzzani M, Elmagarmid A K.Query-Time Record Linkage and Fusion over Web Databases[C]// Proceedings of the 31st International Conference on Data Engineering. IEEE, 2015: 42-53.
[59]
Gruenheid A, Dong X L, Srivastava D.Incremental Record Linkage[J]. Proceedings of the VLDB Endowment, 2014, 7(9): 697-708.
[60]
Whang S E, Marmaros D, Garcia-Molina H.Pay-as-you-go Entity Resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5): 1111-1124.
[61]
Hall R, Fienberg S E.Privacy-Preserving Record Linkage[C]// Proceedings of the 2013 International Conference on Privacy in Statistical Databases. 2010: 269-283.
(Han Shumin, Shen Derong, Nie Tiezheng, et al.Multi-Party Privacy-Preserving Record Linkage Apprpoach[J]. Journal of Software, 2017, 28(9): 2281-2292.)
[63]
Kuzu M, Kantarcioglu M, Inan A, et al.Efficient Privacy-Aware Record Integration[C]// Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013: 167-178.
[64]
Sweeney L.K-Anonymity: A Model for Protecting Privacy[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5): 557-570.
[65]
Sweeney L.Achieving K-Anonymity Privacy Protection Using Generalization and Suppression[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5): 571-588.
[66]
Kantarcioglu M, Jiang W, Malin B.A Privacy-Preserving Framework for Integrating Person-Specific Databases[C]// Proceedings of the 2008 International Conference on Privacy in Statistical Databases. Springer, 2008: 298-314.
[67]
Dwork C.Differential Privacy: A Survey of Results[C]// Proceedings of the 5th International Conference on Theory and Applications of Models of Computation. 2008: 1-19.
[68]
Inan A, Kantarcioglu M, Ghinita G, et al.Private Record Matching Using Differential Privacy[C]// Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010: 123-134.
(Meng Xiaofeng, Zhang Xiaojian.Big Data Privacy Management[J]. Journal of Computer Research and Development, 2015, 52(2): 265-281.)
[71]
Schnell R, Bachteler T, Reiher J.Privacy-Preserving Record Linkage Using Bloom Filters[J]. BMC Medical Informatics and Decision Making, 2009, 9(1): 41.
[72]
Bloom B H.Space/Time Trade-offs in Hash Coding with Allowable Errors[J]. Communications of the ACM, 1970, 13(7): 422-426.
[73]
Jain N, Dahlin M, Tewari R.Using Bloom Filters to Refine Web Search Results[C]// Proceedings of the 8th International Workshop on the Web and Databases. 2005: 25-30.
[74]
Durham E A, Kantarcioglu M, Xue Y, et al.Composite Bloom Filters for Secure Record Linkage[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2956-2968.
[75]
Vatsalan D, Christen P.Scalable Privacy-Preserving Record Linkage for Multiple Databases[C]// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. ACM, 2014: 1795-1798.
[76]
Randall S M, Ferrante A M, Boyd J H, et al.Privacy- Preserving Record Linkage on Large Real World Datasets[J]. Journal of Biomedical Informatics, 2014, 50: 205-212.
[77]
Jurczyk P, Xiong L.Towards Privacy-Preserving Integration of Distributed Heterogeneous Data[C]// Proceedings of the 2nd PhD Workshop on Information and Knowledge Management. ACM, 2008: 65-72.
[78]
Sheikh R, Mishra D K, Kumar B.Secure Multiparty Computation: From Millionaires Problem to Anonymizer[J]. Information Security Journal: A Global Perspective, 2011, 20(1): 25-33.
[79]
Goldwasser S, Micali S, Rackoff C.The Knowledge Complexity of Interactive Proof Systems[J]. SIAM Journal on Computing, 1989, 18(1): 186-208.
[80]
Quisquater J, Quisquater M, Quisquater M, et al.How to Explain Zero-Knowledge Protocols to Your Children[C]// Proceedings of the 1989 Conference on the Theory and Application of Cryptology. 1989: 628-631.
[81]
Lindell Y, Pinkas B.Secure Multiparty Computation for Privacy-Preserving Data Mining[J]. Journal of Privacy & Confidentiality, 2009, 25(2): 761-766.
(Viktor Mayer-Schonberger, Kenneth Cukier.Big Data: A Revolution That will Transform How We Live, Work, and Think[M]. Hangzhou: Zhejiang People’s Publishing House, 2013.)
[83]
Kooli N, Allesiardo R, Pigneul E.Deep Learning Based Approach for Entity Resolution in Databases[C]// Proceedings of the 2018 Asian Conference on Intelligent Information and Database Systems. 2018: 3-12.
[84]
Schmidhuber J.Deep Learning in Neural Networks: An Overview[J]. Neural Networks, 2015, 61: 85-117.
[85]
Vatsalan D, Sehili Z, Christen P, et al.Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges[A]// Zomaya A, Sakr S. Handbook of Big Data Technologies[M]. 2017: 851-895.
[86]
Altowim Y, Mehrotra S.Parallel Progressive Approach to Entity Resolution Using MapReduce[C]// Proceedings of the 33rd International Conference on Data Engineering. 2017: 909-920.